Skip to content
This repository was archived by the owner on Feb 27, 2025. It is now read-only.

Replace overall_safety_task#4081

Open
ss-es wants to merge 45 commits into
mainfrom
ss/rewrite-overall-safety-task
Open

Replace overall_safety_task#4081
ss-es wants to merge 45 commits into
mainfrom
ss/rewrite-overall-safety-task

Conversation

@ss-es
Copy link
Copy Markdown
Contributor

@ss-es ss-es commented Jan 24, 2025

This PR:

  • Removes the overall_safety_task, replacing it with additional checks in consistency_task
    • consistency_task runs checks after every decide event it receives, terminating the test if those checks fail
    • the task now enforces expectations for failed and successful views.
    • the logic is dramatically simplified: views are successful if we receive a decide event, and fail if we receive a decide for a subsequent view
    • consistency_task now times out if we take too long (currently 4 seconds) between external events (e.g. decides)
  • num_failed_views is completely removed
    • views that always fail are now declared with expected_view_failures: vec![]
    • views that may fail or succeed are declared with possible_view_failures: vec![]

This PR does not:

No checks are added to validate per-node view failures. These should be added in a future PR

Key places to review:

The added checks in consistency_task.rs

@ss-es ss-es changed the base branch from main to ss/consistency-task-fix January 24, 2025 22:09
Base automatically changed from ss/consistency-task-fix to main January 27, 2025 18:40
@ss-es ss-es marked this pull request as ready for review February 4, 2025 06:05
@ss-es ss-es requested a review from bfish713 as a code owner February 4, 2025 06:05
Copy link
Copy Markdown
Collaborator

@bfish713 bfish713 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't thoroughly check the tests. Main comment is about how we spawn/cancel the timeout in the consistency_task

Comment thread crates/testing/src/consistency_task.rs Outdated
result.insert(*view, leaf.1.clone());
}

for (parent, child) in result.values().zip(result.values().skip(1)) {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: isn't this just leaf_pairs defined above

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see a leaf_pairs in this scope, though I noticed the check was duplicated in a git merge (fixed)

Comment thread crates/testing/src/consistency_task.rs Outdated
/// Handles an event from one of multiple receivers.
async fn handle_event(&mut self, (message, id): (Self::Event, usize)) -> Result<()> {
{
let mut timeout_task = spawn_timeout_task(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does a timeout event here cancel timeout_task?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we move this into the the if block just below, that way we are testing that we can get decides within the timeout (the event we care about)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah timeouts (any event) cancels it, I did this intentionally at first to avoid having to manually increase the timeout on tests with a string of timeouts (e.g. staggered restart)

but now I think I agree with you that it should be moved. I'll make the change

Ok(TestProgress::Incomplete)
}
}
pub async fn check_view_success(&self) -> Result<()> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This checking that we didn't decide a view that we think should fail? can't we just check this super simply by looking at the leaf.view rather than getting it back out of our map

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes you're right! this was just copy-paste

will fix

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants